This training workshop will cover carrying out a PCA and finding outlier SNPs in population genomic data using the program PCAdapt. This training workshop is presented as part of the 2023 Research Computing Summer Bootcamp at Northeastern University.
Outline
- Launch RStudio app on Discovery
- Install packages in RStudio (this takes ~20 minutes so we’re starting here)
- Brief introduction to PCA’s while our R packages install
- Live coding demonstration (everyone is encouraged to follow along!)
- Practice questions
Getting R set up
Let’s launch Rstudio in the Open OnDemand app here. Please select the following parameters for your RStudio session:
Install R Packages
The first thing we need to do is download a few different packages. You only need to do this the first time you use each package on your machine.
Today we will be using the R package PCAdapt and the required dependencies. While many packages can be downloaded in R using the `install.packages’. We’ll use that below to install the packages that we need.
install.packages("devtools")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
install.packages(c("BiocManager","vcfR","pcadapt"))
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
# type "yes" when prompted
BiocManager::install("qvalue")
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'qvalue'
If you have an error reading “Warning in install.packages : package ?BiocManager? is not available for this version of R” Go to “Help”, then “Check for Updates” and update your R version. You also may need to restart your R.
After you have successfully downloaded all the packages needed for today you can start a new script in RStudio.
What is a PCA?
Principal component analysis: A method to reduce the dimensionality of large datasets and enable visualization of multidimensional data.
In population genomics our data is typically:
Thousands of SNPs from hundreds of individuals
And they usually are used to produce something that relates geographic position to genetic variation and looks for correlations between the two.
With PCA’s for population genomics were interested in patterns of genetic variation across a landscape. A PCA is a great way to reduce the dimensionality of population genomic data (thousands of SNPs for hundreds of individuals) and can be used to identify individual SNPs that shape sample clustering in a PCA.
Why might we care about patterns of genetic variation?
Tells us about population demography, gene flow, and selection
Evolution requires variation in phenotype and genotype
For species of conservation concern: patterns of genetic differentiation can inform conservation strategies.
We can find cool and interesting patterns that help us understand how geography shapes genetic variation
How did we generate the data we will be working with today?
Steps involved in running a PCA
- Data standardization
- PCA is sensitive to variances in the data
- For example, if one site is much more variable than any other
Calculate a covariance matrix where (x, y, z) are individual samples
Compute the eigenvectors and eigenvalues of the covariance matrix This is where we identify the principal components of the data!
Eigenvalues and Eigenvectors
Using the covariance matrix we calculate eigenvectors and eigenvalues. An eigenvector points in the direction it is stretched by vector transformation and the eigenvalue is how much it was stretched.
What is a principal component?
New variables constructed from a mixture of our initial variable.
We can determine the number of Principal Components using a scree plot (shown below) which shows the porportion of variance in our data that is explained by each component.
Data for today’s workshop
Whole genome low coverage sequence from 117 individual sea slugs collected across 8 sites along the California coast. We have ~29k SNPs which were filtered in vcftools for LD and missingness. You’ll be working with two files: a VCF (Variant Call Format) and a metadata file.
Download the data
In Rstudio we will move from the standard ‘console’ to the ‘terminal’ window where we can access the terminal on the compute node that we have running RStudio. Here we will run the command below to download the data that we need for today’s session. This will make a new directory called “rcbootcampPCA_2023” with two files: 5.n190_moreSNPS_popgen_thinned.recode.vcf slug_metadata_sorted_popgen.csv
Run these commands in the ‘terminal’ window of RStudio
curl https://github.com/northeastern-rc/training-PCAdapt/raw/main/rcbootcampPCA_data_2023.tar.gz
#now untar the file
tar -xvzf rcbootcampPCA_data_2023.tar.gz
#check the contents of the file
ls rcbootcampPCA_2023
## % Total % Received % Xferd Average Speed Time Time Time Current
## Dload Upload Total Spent Left Speed
##
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
## tar: Error opening archive: Failed to open 'rcbootcampPCA_data_2023.tar.gz'
## 5.n190_moreSNPS_popgen_thinned.recode.vcf
## positions.txt
## slug_metadata_sorted_popgen.csv
Finding outliers using pcadapt
Today we will achieve two goals, 1) estimate population clustering via PCA, and 2) find outliers based on that clustering. For this you will use the pcadapt package. We will just use a few functions, but you can do a lot with this package; the documentation is found here
I always like to call the different R packages I plan to use at the very top of the script so I just have to call them once:
library(pcadapt)
library(qvalue)
library(vcfR)
library(dplyr)
library(ggplot2)
Lets start by setting a working directory. I like to define the paths to my input files at the top of the script, that way if I want to use a slightly different input file or if I want to share my script it’s easy to see how to change it.
#setwd("rcbootcampPCA_2023/")
# path to the data and meta
vcf.path="5.n190_moreSNPS_popgen_thinned.recode.vcf"
meta.path="slug_metadata_sorted_popgen.csv"
Let’s read in the genetic data from the VCF file. Because VCF is a common file format for SNP data, pcadapt comes with a special function for reading this type of file. You will get a warning message if you run on your own computer, but don’t worry about it.
## 2 variant(s) have been discarded as they are not SNPs.
## Summary:
##
## - input file: 5.n190_moreSNPS_popgen_thinned.recode.vcf
## - output file: /var/folders/p2/dqjjxc1n49z5wsd4xmzzrz500000gp/T//RtmpfXkpld/filedb8c33213509.pcadapt
##
## - number of individuals detected: 117
## - number of loci detected: 29821
##
## 29819 lines detected.
## 117 columns detected.
Now let’s read in the metadata. This is a CSV (comma separated values) file, which is a spreadsheet-style format that can be output from programs like Excel. The columns are separated with commas. Once you load the file, take a minute to examine the contents.
meta <- read.csv(meta.path)
head(meta)
## Sample pops Length.bp. Q30... GC... TotalReads TotalBases slug
## 1 BOL1 BO 150;150 72.19;79.42 36.49;36.33 17783142 2667471300 TRUE
## 2 BOL10 BO 150;150 94.84;94.44 35.02;35.01 59158990 8873848500 TRUE
## 3 BOL11 BO 150;150 80.60;88.28 36.01;35.79 49648684 7447302600 TRUE
## 4 BOL12 BO 150;150 93.80;87.67 35.30;35.31 60522412 9078361800 TRUE
## 5 BOL13 BO 150;150 93.98;93.64 34.56;34.48 72585508 10887826200 TRUE
## 6 BOL14 BO 150;150 90.53;91.08 35.06;34.97 66929838 10039475700 TRUE
## cov eggmass site lat lon Q30_1 Q30_2
## 1 2.963857 L Bolinas Bay 37.90614 -122.6507 90.65 93.95
## 2 9.859832 L Bolinas Bay 37.90614 -122.6507 98.42 98.31
## 3 8.274781 L Bolinas Bay 37.90614 -122.6507 93.70 96.62
## 4 10.087069 L Bolinas Bay 37.90614 -122.6507 98.15 96.23
## 5 12.097585 L Bolinas Bay 37.90614 -122.6507 98.15 98.08
## 6 11.154973 L Bolinas Bay 37.90614 -122.6507 97.06 97.34
IMPORTANT NOTE: Here the samples in the VCF are in the same order as the samples in the metadata file. Triple check ahead of time that this is the case otherwise you may get some weird results!
Now that we’ve read in all the data, the first step for pcadapt is to make a PCA. Here’s how we do it in pcadapt:
x <- pcadapt(input=genos,K=20)
plot(x,option="screeplot")
We can see from this plot that the vast majority of genetic variation can be explained by the first two principal components axes. And that really most of the variation is explained by the first 4 PC axes. So let’s focus on those 4.
res <- pcadapt(genos, K = 4)
plot(res)
par(mfrow = c(2, 2))
for (i in 1:4)
plot(res$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
We can now plot the PC axes and color by collection site (or population).
attach(meta)
plot(x, option = "scores", i = 1, j = 2, pop = site)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.
plot(x, option = "scores", i = 1, j = 3, pop = site)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.
plot(x, option = "scores", i = 1, j = 4, pop = site)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.
plot(x, option = "scores", i = 2, j = 3, pop = site)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.
plot(x, option = "scores", i = 3, j = 4, pop = site)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.
And maybe have some fun with colors
my.colors<-colorRampPalette(c("#440154FF", "#7AD151FF"))
length(unique(meta$pops)) # how many collection sites do we have
## [1] 8
my.colors(8) #get 8 colors
## [1] "#440154" "#4B1E53" "#533C53" "#5B5A52" "#627752" "#6A9551" "#72B351"
## [8] "#7AD151"
barplot(1:8, col=my.colors(8)) # visualize them in a barplot
Use those colors to make our PCA
plot(x, option = "scores", i = 1, j = 2, pop = site, col=my.colors(8))
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.
Find the outliers
This gives us a visual idea of which SNPs might be associated with population differences. If we want to identify statistical outliers, we first need to adjust the p-values:
qval <- qvalue(x$pvalues)$qvalues
outliers <- which(qval<0.1)
length(outliers)
## [1] 508
Sure! There are some outliers. We can and should also input a higher qvalue cutoff.
padj <- p.adjust(x$pvalues, method = "BH")
alpha <- 0.01
outliers <- which(padj < alpha)
length(outliers)
## [1] 315
Manhattan plot
We can plot this as a manhattan plot
plot(res, option="manhattan")
Manipulating the input file
Say you’re interested in the pattern of genetic variation in the populations with population TB (Tomales Bay) removed. How might we modify our input file to remove samples from that population so we can see what the relationship is between the other collection sites?
We may also want to now remove those outliers and replot out PCA. Both of these questions manipulate the vcf file dataframe either by removing individuals or by removing SNPs and are useful to know how to do.
We could do this with the vcf file directly with vcftools (via the flags –exclude or –keep). Or we can keep everything in our R script. Let’s start by removing the TB individuals
Removing individuals from a pcadapt object
You may recall that when we first read in our vcf file with the ‘read.pcadpt’ function that the output told us something about the dimensions of the data. We’ll rerun that command here this time telling it to output a “matrix”.
genos <- read.pcadapt(vcf.path,type=c("vcf"), type.out="matrix")
## 2 variant(s) have been discarded as they are not SNPs.
## Summary:
##
## - input file: 5.n190_moreSNPS_popgen_thinned.recode.vcf
## - output file: /var/folders/p2/dqjjxc1n49z5wsd4xmzzrz500000gp/T//RtmpfXkpld/filedb8c74a55060.pcadapt
##
## - number of individuals detected: 117
## - number of loci detected: 29821
##
## 29819 lines detected.
## 117 columns detected.
So the object genos contains 117 columns. Those columns correspond to the rows in our meta file. Let’s look at our meta file (below), and we can see below that the TB sites are individuals between rows 103 and 117.
which(meta$pops == "TB")
## [1] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
So let’s use this information to rerun read.pcadapt()
and remove the TB individuals. We also need to use the transpose
function t() to get read.pcadapt to correctly identify
individuals from SNPs.We also need to use the transpose function to get
read.pcadapt to correctly identify individuals from SNPs. We’ll then
replot it by population to see how the PCA looks.
genos_filt<-read.pcadapt(t(genos[-c(103:117),]))
attach(meta)
noTB <- pcadapt(genos_filt, K = 4)
plot(noTB, option = "scores", i = 1, j = 2, pop = site[-c(103:117)], col=my.colors(7))
Look at that! How easy. :) While we could have done this in vcftools, and it may be necessary to do so for other programs it’s a great idea to try and keep all of the manipulations that you make to a file recorded in a script so those changes can be recreated and clearly documented.
Note
it's easy to mess up file formats in some cases and care should be taken to maintain what the next program you're running expects, but also, it's not broke until it is...so feel free to try things! You may find a clever solution. One of the 'fun factors' in bioinformatics is that you can always go back to your raw data or previous file etc. So try things (and never directly change the raw data!).
Let’s apply this lesson on removing individuals to remove SNPs which you can try to do in the first exercise below.
PCAdapt Exercises
- Now that you’ve seen one example of manipulating the input file for pcadapt. Try removing the outlier SNPs that we identified from the ‘genos’ input file. And then recreate our PCA without those outliers. hint the object ‘outliers’ contains a numeric list of the row numbers. If you know how to subset a dataframe by row numbers (it’s not too different than what we did with columns) it should be pretty straight forward. If you don’t know, feel free to ask a neighbor or your friendly google! The solution is visible below.
Solution
#look at outliers
head(outliers)
## [1] 18 80 132 264 332 350
genos_filtSNPs<-read.pcadapt(t(genos[,-c(unlist(outliers))]))
attach(meta)
noSNPs <- pcadapt(genos_filtSNPs, K = 4)
plot(noSNPs, option = "scores", i = 1, j = 2, pop = site, col=my.colors(8))
# So you will notice that even by removing the outliers we still get the same overall pattern in the PCA. We would need to lower the threshold for effect size and select more outliers to see the overall pattern in the PCA shift. You can experiment with that here with this code!
- Perhaps we are particularly interested in SNPs associated with
latitude. A first step might be to ask whether either of our PC axes
represent latitudinal variation.
x$scoresgives you the PC loadings for each sample along the first and second PC axes. Combine these with the metadata to see whether either PC axes are correlated with latitude (for our purposes you can just visualize the relationship, you don’t have to run the statistics.)
Solution
plot(x$scores[,1],as.factor(meta$lat),pch=19,col=as.factor(meta$pops))
plot(x$scores[,2]~meta$lat,pch=19,col="gray")
- If we are only interested in SNPs associated with a single principal component, we can get pcadapt to give us correlation scores for each PC axis independently. Using the guidance provided in the documentation here make a manhattan plot of SNPs driving variation along the first principal component only
Solution
x_cw <- pcadapt(genos,K=2,method ="componentwise")
plot(x_cw,option="manhattan",K=2)